Creating Electronic Books Using HTML

What is HTML?

HTML is an acronym for HyperText Markup Language. HTML uses tags such as <h1> and </h1> to structure a text document into headings, paragraphs, lists, hypertext links etc. It is a non-proprietary format based upon SGML, and can be created and processed by a wide range of tools, from simple plain text editors-you type it in from scratch-to sophisticated WYSIWYG authoring tools. The purpose of this document is to explain how to use HTML to create electronic books (e-books).

The basic structure of HTML is content surrounded by tags. A tag is a keyword enclosed in angle brackets; for example <p>. The element indicated by a start tag, in this example a paragraph, continues until a matching end tag is encountered in the text stream. An end tag uses the same keyword as the start tag, but has a slash character before the keyword. Thus, to end a paragraph you would use </p>.[1]

As you can see, the use of angle brackets to enclose element names causes an important problem: how can we place angle brackets in our text without them being confused as parts of tags? This problem is solved by character references, or entities. Character references are numeric or symbolic names for characters that may be included in an HTML document; they begin with an ampersand (&) and end with a semi-colon (;). Every HTML rendering program must recognize at least three entities: &lt; which represents the '<' character, &gt; which represents the '>' character, and &amp; which represents the '&' character. Entities will be discussed in greater detail later in this document.

HTML elements may have associated properties, called attributes, which may have values. If an attribute has a value (as most do) the value follows the attribute name and is delimited by the equals sign. Any particular attribute name and the associated value are separated from the keyword in a tag by a space, and must appear before the final '>' character of the element's start tag. Any number of attributes, separated by spaces, may appear in an element's start tag. Attribute values are usually delimited by double or single quotation marks, although you must finish the value with the same mark that began it. Some values may be specified without the use of quotation marks, however such marks are always acceptable. The World Wide Web Consortium recommends the use of quotation marks even when it is possible to eliminate them.

Any document can be divided into two major parts: the actual content of the document, and information about the document itself, usually called "meta-data[2]". An HTML document is indicated by the <html> tag set, and is composed of two sub-elements: the meta-data, contained the the <head> element, and the content, contained in the <body> element. The <head> element is optional, but if it exists it must contain at least a single <title> element. Here's an example of a simple HTML document, for which the "lang" attribute has the value "en":

<html lang="en">
   <head>
      <title>My first HTML document</title>
   </head>

   <body>
      <p>Hello world!</p>
      <p>Farewell!</p>
   </body>
</html>

You will notice, even in this simple example, that HTML is used to indicate the structure of a document, but not the manner of its presentation. This simple example has a title, and two short paragraphs. Nowhere is it specified how, where, or even if the title must be displayed. Nor is it specified whether the paragraphs should be indented or by how much, or whether a blank line should be inserted between the two paragraphs. While not obvious here, also unresolved is the decision as to how long a line should be before starting a new line within a paragraph, or what font size and face to use in displaying the text. All these decisions are the responsibility of the User Agent (UA) which is the software being used to display the HTML document. By strictly observing this segregation of structure and presentation, it is possible to create a document which contains all of the information necessary for a UA to recreate a richly formated document, without dictating the exact nature of the format.

What is an Electronic Book?

The definition of an electronic book (e-book) is fuzzy at best; but then, the definition of a paper book is also fuzzy. Suffice it to say that an electronic book is an electronic version of a book. (In formal logic, this is referred to as a tautology.)

With this in mind, let's start applying HTML to e-books. While the HTML specification is fairly complex, quite satisfactory e-books can be created using only a few simple HTML tags.

The cast of characters

Printers have the luxury of being able to create any character or symbol that can be cast in lead. E-book creators are limited to using those letters and symbols which have numerical representations. Fortunately, this is a very large group.

Originally, ASCII codes were limited to seven bits, or the numbers 0 to 127. Of this number, 0 to 31, and 127 were reserved for non-printing "control" codes. The english alphabet, including punctuation and both cases had to be mapped onto the remaining 95 numbers. If you only used English this was adequate; it was the American Standard Code for Information Interchange, after all, and keyboards only had 96 keys.

When IBM introduced the PC, using an eight-bit architecture, they realised that the extra bit allowed them to specify another 128 characters! IBM used about half of this windfall to create a set of box drawing characters, and the remainder to map some mathematical symbols, a handful of greek characters, and enough latin characters to support French, Italian and Spanish. If you were Scandanavian you were out of luck, but who in the computer world would ever need to spell Tørvalds?

The increasing internationalization of the internet caused many organizations, most notably the Internation Standards Organization (ISO) to revisit the character encoding issue. This resulted in the development of the ISO 8859 (latin-1) eight bit character encoding, and the ISO 10646 (unicode) 16 bit character encoding, which was adopted for HTML encoding (values between 0 and 255 are identical in the two encodings).

So how are these new encodings embedded into an HTML file which is, after all, just text and is destined to be transmitted using a protocol which only guarantees transmission of 7-bit characters? Well, we already have a mechanism for encoding characters that are reserved by HTML, so why don't we extend that to be able to represent any arbitrary character? This is exacly what HTML does.

As stated earlier, entities are numeric or symbolic names for characters which begin with an ampersand (&) and end with a semi-colon (;). Any unicode character can be represented by an entity composed of "&#", followed by the numeric value and ending with a semi-colon. For example, the lower case o with a slash through it is represented by "&#248;", and the copyright symbol is "&#169;".

Of course, even though the internet does not guarantee to preserve 8-bit characters, as a practical matter almost all software uses 8-bit encoding. In this case, you can simply include 8-bit characters in your e-book if you have a tool to create them. I have never had a problem with this, but if you choose to do so be aware that you are pushing the limits.

Some common symbols do not fall into the Latin-1 character set and must be specified using their unicode values. Two of the most common are the "em-dash", "&#8212;", and the trademark symbol, "&#8482;".

In order to give authors a more intuitive way of referring to characters in the document character set, HTML offers a set of character entity references. Character entity references use symbolic names so that authors need not remember a character's numeric value. For example, &copy; is presumably easier to remember than "&#169;" and "&oslash;" is easier to remember than "&#248;". A complete list of character entity references can be found at http://www.w3.org/TR/html4/sgml/entities.html. On the other hand, some older UAs do not recognize all character entity references particularly those which map to values above 255. When in doubt, use the numbers.

One last thing. For some reason, ISO did not specify a character mapping for values from 128 to 159 (I suspect it had something to do with the fact that they are the same as the control code values but with the high bit set-but that's just a guess). As usual, there is no gap in the standards that Microsoft does not try to invade, and the Latin-1 character set is no exception. Presumably in an attempt to keep all punctuation in the 8-bit range Microsoft has created a variation on the Latin-1 set that maps 16-bit characters into the 128-159 range. For example, on Microsoft platforms &#151; is an em-dash, &#147; is a left double quote and &#148; is a right double quote. I mention this only because Microsoft editing tools for HTML will frequently use these illegal values without warning, and they will look correct so long as they are viewed with software running on a Microsoft OS. Beware of this non-conforming behavior if you typically use Microsoft tools.

Lookin' good!

Within text, authors frequently indicate emphasis or strong emphasis by using italicized or bold fonts. In HTML this is indicated by the use of phrase elements which add structural information to text fragments. A number of these elements are defined in the HTML specification, but only two are commonly used in e-books: <em> and <strong>. As is no doubt obvious, <em> indicates emphasis, and <strong> indicates stronger emphasis. Most UAs display these phrases with italic and bold font faces, but this treatment is not guaranteed.

At times, it may be desirable to change a font style for reasons other than phrasing. For example, in American english, foreign words and phrases are frequently italicized, as are titles of articles and names of starships. In these cases, font style elements are used. Font style elements include:

 

<i> (italic text style)
<b> (bold text style)
<tt> (teletype or monospaced text)
<big> (text in a "large" font)
<small> text in a "small" font)
<strike> and <s> (strike-through style text), and
<u> (underlined text).

 

According to the HTML 4.0 specification, "Although [font style elements] are not all deprecated, their use is discouraged in favor of style sheets", so these tags should be used sparingly if not avoided completely.

The following HTML fragment demonstrates the use of phrase elements and font style elements:

<p>"That was the <i>Canon in D</i> by Johann Pachelbel," 
he explained.  "Did you like it?"</p>
<p>"Ohh, that was <em>nice</em>", she sighed in reply.</p>

"Breakin' up is hard to do . . ."

When rendering text, UAs are required to "flow" the text, wrapping lines at spaces as required. The corollary to this is that newline characters in text must be ignored when wrapping. UAs must cause a line break short of the right margin at the end of paragraphs, but not until then. Occasionally, however, lines must be broken at specific places. This is accomplished by the <br> element.

When a <br> is specified a UA is required to start a new text. As is evident, no content is encapsulated by the <br> element, so it is one of a small set of tags called empty elements. These elements are not allowed to have an end tag, so the phrase

<br></br>

is not correct. Unfortunately, XML, which is converging with HTML, does not allow unterminated tags. Instead, XML has a specific syntax for "self-terminating" tags where the tag ends with "/>" instead of ">"; e.g. <br/>. All UAs should recognize the XML-style self-terminating tag; unfortunately some do not. However, the specification is clear that white space must be ignored inside tags, and experience has shown that all known User Agents render self-terminating tags correctly when there is a space between the tag keyword and the terminating slash. For maximum future compatibility a break element should be entered as "<br />".

For example, Harry Potter's address is:

Mr. H. Potter<br />
The Cupboard under the Stairs<br />
4 Privet Drive<br />
Little Whinging<br />
Surrey<br />

The <br> element has the potential of being seriously abused, for example by attempting to create pseudo-paragraphs by putting <br /> at the end of a long line of text. When you use <br />, be sure that there is no other alternative.

Just as there are times when you will want to force a line break, there will be times when you will want to control the location of line breaks. To help with this, the Latin-1 character set provides a no-break space: &nbsp; or &#160;. This is a character which displays just like a regular space, but which is not considered white space, and therefore UAs are not allowed to wrap a line at this point. For example, the trade name "Coca&nbsp;Cola" could be wrapped before "Coca" or after "Cola", but never between them.

As part of the word wrapping algorithm, User Agents are required to collapse consecutive instances of white space into a single space. This behavior can be avoided by using the &nbsp; entity. Frankly, I can see no reason to do this, particularly when using a proportional font, but the capability does exist. Use with extreme caution.

In printed books an ellipsis is generally printed as three periods with a space between each. If you attempt this with HTML, you may end up with displayed text where half (okay, okay, one-third) of the ellipsis is on one line, and the remainder on another. Unicode provides a single character for the ellipsis-&#8230;, and HTML specifies a character entity reference-&hellip;. There are problems with this entity as implemented. First, no part of the entity is considered white space, so the wrapping rules for "Coca&hellip;Cola" would be exactly the same as for "Coca&nbsp;Cola". Second, the dots in the &hellip; entity do not look like periods, so an ellipsis which appears at the end of a complete sentence (such as "That's not the way I would do it. . . .") look bizarre. My preference is for ellipses which are constructed with periods and &nbsp; entities (i.e. &nbsp;.&nbsp;.&nbsp;.) which solves both of these problems. If you choose to use the &hellip; entity be aware that the character entity reference is not yet well supported; you would be better off using its Unicode value, &#8230;.

User Agents wrap lines only at white spaces, which are defined as a space (&#32;), a tab (&#9;) a line feed( &#10;), a form feed (&#12;), a carriage return (&#13), or a zero-width space (&#8203); (I am not aware of any UA which recognizes this last value). As a result, word wrapping does not necessarily occur where you would usually expect after some punctuation. For example, the phrase "zero-width" will be wrapped before the word "zero" or after the word "width" but never after the hyphen.

How to structure an e-book using HTML

At its most fundamental, a book is a bunch of written words, usually structured into sentences and paragraphs. HTML defines no element for sentences, but does define the paragraph element. The first requirement for an HTML e-book is to indicate all paragraphs with <p> . . . </p> tags.

In many cases, a book is also divided into larger sections such as Chapters and Parts. Until HTML 3.0 there was no way to partition a document into other, arbitrary, divisions. Instead, HTML offers heading elements, <h1> to <h6>, where <h1> is the most important and <h6> is the least important. UAs typically render more important headings in a larger font than less important headings. My experience, with a fairly limited number of UAs, is that headings are all rendered in a bold face font and that <h4>uses a font size identical to the default font. YMMV.

For e-books, there are as of yet no standards setting out the usage for each level of header. The emerging consensus, at least in alt.binaries.e-book, seems to be in favor of using <h1> for the displayed title of a book, and <h3> for those part or chapter headings where a paper book would ordinarily begin on a new page.

Occasionally authors will desire to break a book into sections visually without any kind of header. Examples include the break inside chapters of novels where focus changes from one character to another, or a horizontal line between the end of a chapter and the end/footnotes for that chapter. These effects can be accomplished within HTML by using the <hr /> (horizontal rule) element or the empty paragraph.

When a <hr /> is specified a UA is required to start a new text line composed of a horizontal rule, or straight line. <hr /> is also an empty element, so for maximum compatibility the horizontal rule should always be written as a self-terminating tag.

Frequently, a blank line is desired as a separator, without the visual impact of a horizontal rule. Usually, your first impression is to use the break <br /> tag, but this idea has some problems-not in the theory, but in the practice.

By default, most web browsers double space between paragraphs. In reality, what they do is create paragraphs that have a half-line margin before an after the paragraph. A single break element gets lost in this situation; to achieve the desired visual separation it is necessary to use two break tags.

But many e-book viewer implementations, as well as HTML documents which have specifically overridden the browsers' default paragraph margins, do not exhibit this same behavior. In these cases, two break tags add too much space. This inconsistent behavior can be avoided by using an empty paragraph, which will reflect the same margins as any other paragraph, growing or shrinking according to the margins applied to paragraphs globally.

When I say an empty paragraph, however, I don't really mean an empty paragraph. Truly empty paragraphs would not really accomplish our goal, as a paragraph with no top and bottom margins, and no content, would disappear (and in any event, truly empty paragraphs are prohibited by the HTML specification). Instead, we need a paragraph that is visually empty, but which contains some content that we can be sure will never be removed.

This looks like a job for-no-break space!

The best option to create a visual space between sections is to create a paragraph containing a single no-break space, like this:

<p>&nbsp;</p>

"Poems are made by fools like me . . ."

Word wrapping and the use of proportional fonts can cause problems when rendering text where the visual structure is as meaningful as the words themselves, such as poetry. For these cases HTML furnishes the <pre> element, which tells visual user agents that the enclosed text is "preformatted".

Inside the <pre> element automatic word wrap is disabled, white space (spaces and newline characters) are not altered, and text is rendered with a fixed-pitch font. Because all letters and spaces have the same width, it is possible to begin selected lines at the same column, and because white space is not altered it is possible to specify precisely where each line should end.

The following example shows a preformatted verse from Shelly's poem "To a Skylark":

<pre>
    Higher still and higher
      From the earth thou springest
    Like a cloud of fire;
      The blue deep thou wingest,
And singing still dost soar, and soaring ever singest.
</pre>

When using the <pre> element, you should be aware that the variation in the screen width on e-book reading devices is even larger than most HTML User Agents. In cases where lines of poetry are longer than the physical width of the screen, a rendering device must choose 1.) to provide a left-right scrollbar (this is what most browsers do), 2.) to wrap the text anyway (which is what most PDAs do; according to the spec this is acceptable if undesirable). This situation should be avoided whenever possible.

Where am I . . .?

The major difference between HTML and its predecessor, SGML, is the concept of hyperlinks. Hyperlinks are the threads of the world wide web which link one document to another. A hyperlink has two ends: a source anchor and a destination anchor, each of which is indicated by the <a> element. A target anchor indicates a document destination; a source anchor provides the information needed to find a specific target.

The concept of hyperlinks may seem irrelevant in the context of e-books, but a target need not be a document on the internet; it could be a file in the same file system as a source document, or even a specific location in the same document as a source anchor. Anchors allow us to build common book features such as indices, footnotes and endnotes, and tables of contents.

The only requirement for a destination anchor is some sort of identifier which is unique within the document; this identifier is added as an attribute of the <a> tag. Originally, HTML used the "name" attribute to indicate the unique identifier for a destination anchor. Recently, the specification has changed to recommend use of the "id" attribute instead. Most known UAs still support the "name" attribute, and many still do not support the "id" attribute. The new Netscape 5.x browsers support the "id" attribute, but not the "name" attribute. Fortunately, both attributes may be used simultaneously so long as the value is the same for both attributes; I recommend using both attributes for all destination anchors.

The destination anchor for Chapter One could be as follows:

<a id="Chap1" name="Chap1"></a><h3>Chapter One</h3>

As you can see, a destination anchor can be (and usually is) empty, so the syntax:

<a id="Chap1" name="Chap1" /><h3>Chapter One</h3>

should also be acceptable. However, I have encountered some User Agents (e.g. Netscape 4.7x) which, for some reason, do not deal well with self-terminating anchor tags. I recommend using an explicit end tag for all anchor elements.

The earlier HTML specifications make no restriction as to the nature of identifiers associated with an anchor tag, except that they must be unique within the document and that they are case-sensitive (although two identifiers may not be used in the same document if they differ only by case). Later versions of the HTML specification indicates that "name" and "id" identifiers must begin with an alphabetic character [A-Za-z] and may only contain letters, numbers, hyphens ("-"), underscores ("_"), colons (":"), and periods ("."). I have discovered that, consistent with the most recent specification, the Microsoft content SDK for MS Reader does not accept anchor identifiers which do not begin with an alphabetic character.

A source anchor uses the "href" attribute to indicate the corresponding destination anchor. The value of the "href" attribute must be a valid Universal Resource Identifier (URI). URIs typically are composed of a namespace identifier, followed by "://", a network locator, another slash, and a resource ID. In the URI:

http://www.microsoft.com/reader/default.asp

the namespace is "http", the network locator is "www.microsoft.com", and the resource id is "/reader/default.asp". URIs may also contain fragment identifiers which are identifiers separated from the remainder of the URI by a pound sign ('#'). Fragment identifiers can identify a specific destination anchor inside a document, e.g.:

http://www.microsoft.com/reader/index.html#part1

If a namespace is not specified in a URI, the current namespace is used; if a locator is not specified, the current location is used; and if the resource identifier is not specified, the current resource is used. This is important for e-books because in most instances the current resource is exactly the one we want; the fragment identifier is the only thing we are interested in.

The difference between a source anchor and a destination anchor is mostly conceptual; there is no reason that an anchor cannot be both a source and a destination. Indeed, this is how we create endnotes.

An endnote is just like a footnote where the note is placed at the end of the document instead of at the end of a the page where the footnote occurs. Because the notion of a page is foreign to the notion of an e-book, we use endnotes instead of footnotes. To create an anchor to an endnote you would do something similar to this:

This is the sentence which needs an endnote.<a id="ret1" name="ret1" href="fn1">[1]</a>

Then, at the end of the document, you place the explanatory text, with a source anchor that takes you back to where the endnote was referenced:

<a id="fn1" name="fn1" href="ret1">[1]</a>  This is an aside related to the earlier text.

Clicking on the footnote indicator ("[1]") in the text will take you to the text of the footnote, and clicking on the footnote indicator in the endnote will take you back to where you started.

Endnotes need not appear in the same file as the main text. Simply add a resource identifier (usually a file name) to the URI, e.g.:

This footnote appears in a separate file.<a id="ret1" name="ret1" href="footnotes.html#fn1">[1]</a>

Some e-book readers (e.g. Microsoft Reader) will pop up a separate window when endnote links are contained in a separate file. This is a nice feature which I believe more and more readers will duplicate.

There is yet another way to deal with short footnotes. The HTML specification provides a "title" attribute for anchor tags which "offers advisory information about the element for which it is set." Many web browsers display the title as a "fly-over tool-tip", which is a short message that appears when the pointing device (mouse pointer) pauses over the tag. By placing the footnote text inside the anchor tag as a "title" attribute, the reader can see the footnote by moving the mouse cursor over the anchor tag, without having to jump to a different place in the document. The footnote will disappear when the mouse pointer is moved.

This is a sentence which has a footnote.<a id="ret1" name="ret1" href="fn1" title="This is the footnote">[1]</a>

Remember, many devices used for reading e-books, such as PDAs, do not have a mouse, so this trick should always be used in conjunction with a footnote destination anchor located somewhere else.

. . . and where am I going?

A table of contents is nothing more than a list of references to particular places inside a document, such as the beginning of a chapter. HTML provides three elements specifically designed to create lists: <ol>, <ul>, and <li>. The <ol> element is used to create ordered lists, the <ul> element is used to create unordered lists, and the <li> element is used to indicate each item in a list. In an ordered list each list item is displayed with its entry number; items in unordered are indicated by a "bullet".

HTML lists can be nested, and the numbers or bullets are displayed differently to indicate the level of nesting. Here is an example of a fairly complex table of contents built using lists, for J.R.R. Tolkien's The Lord of the Rings:

<h1>The Lord of the Rings</h1>
<h3><a id="toc" name="toc"></a>Table of Contents</h3>
<ul>
   <li><a href="Prolog">Prolog</a></li>
   <li><a href="Notes">Notes on Shire Records</a></li>
   <li><a href="B1P1C1">The Fellowship of the Ring</a></li>
   <ul>
      <li><a href="B1C1">Book I</a></li>
      <ol>
         <li><a href="B1C1">A Long-expected Party</a></li>
         <li><a href="B1C2">The Shadow of the Past</a></li>
         <li><a href="B1C3">Three is Company</a></li>
      </ol>
      <li><a href="B2C1">Book II</a></li>
      <ol>
         <li><a href="B2C1">Many Meetings</a></li>
         <li><a href="B2C2">The Council of Elrond</a></li>
         <li><a href="B2C3">The Ring goes South</a></li>
      <ol>
   </ul>
   <li><a href="B3C1">The Two Towers</a></li>
   <ul>
      <li><a href="B3C1">Book III</a></li>
      <ol>
         <li><a href="B3C1">The Departure of Boromir</a></li>
         <li><a href="B3C2">The Riders of Rohan</a></li>
         <li><a href="B3C3">The Uruk-Hai</a></li>
      </ol>
      <li><a href="B4C1">Book III</a></li>
      <ol>
         <li><a href="B4C1">The Taming of Sm&#233;agol</a></li>
         <li><a href="B4C2">The Passage of the Marshes</a></li>
         <li><a href="B4C3">The Black Gate is Closed</a></li>
      </ol>
   </ul>
</ul>

Of course, a destination anchor need not have a source anchor to be useful. It is possible to create a destination anchor for each page number in a paper book. Destination anchors are not displayed, but by their presence it is possible for a UA to go directly to a specific page by simply specifying the page number as a fragment identifer, e.g.:

lotr.html#p271

Retaining page numbers for e-books from other sources, or creating them at regular intervals, not only makes it possible to jump to a specific point in a work, but they can also be used to create indicies[3], and can be helpful in proofreading.

A picture is worth a thousand words

Frequently, books contain images. Images can also be included in HTML-based e-books by using the <img> element. This element is an empty element and so cannot have an end tag, but should be self-terminated. The <img> element has two "required" attributes, "src" and "alt". The "src" attribute specifies the name of the image file to display; the "alt" attribute contains text to be displayed if the image cannot be displayed. The "alt" attribute was optional in HTML version 3.2, and there are currently no UAs which actually require the attribute.

The "align" attribute can be used to move an image to the right or left side of accompanying text. However, I cannot find any way using the "align" attribute to cause an image to be centered. It is possible to center an image using the <center> tag which is now, unfortunately, deprecated.

A little style

Up until now, we have looked at how to use HTML to structure an e-book gramatically, but we haven't looked at how to structure an e-book visually; that is, how a User Agent should present the document to a reader. This is one of the most controversial issues surrounding the creation of e-books; many think that that an e-book should be heavily formatted to looks as much as possible like a paper book, and yet a recent survey has shown that control of presentation is one of the top ten most important features in a e-book UA.

Cascading Style Sheets (CSS) are a feature added to HTML in version 3.2 which allow us to satisfy both of these desires.

CSS provides a mechanism to apply any arbitrary style, including a font type, font face or font size, as well as colors, margins and indentation to a class of HTML elements. The class could be all elements of particular type (e.g. <p>) or a subset of a particular type (e.g. <p class="first">).

A style is set for an element class by including the <style>. . .</style> element in the <head>. Between the start and end tags you place all the style rules you want to apply to any particular element's style. Each rule starts with a selector name followed by a list of style properties bracketed by { and }. E-books usually are easier to read in a browser when there are some margins on either side. This style will add margins to each side of the browser window for the entire body of an HTML document:

<style type="text/css">
  body { margin-left: 5%; margin-right: 5%; }
</style>

Each style property starts with the property's name and a colon followed by the value for the property. When there is more than one style property in the rule, they are delimited by a semicolon. In this case, you will notice that the values are expressed as percentages; when you resize the UA's window, the margin's will scale with the size of the window. Other valid margin properties include "margin-top", "margin-bottom", and "margin" which applies equally to all four margins.

It is also possible to specify sizes using other methods: ems, pixels and absolute sizes in inches (in), millimeters (mm), points (pt) or picas (pc) A point is equal to 1/72 inch, and a pica is 12 points. An ­em is an old printer's measure which is defined as the width of a capital M in the type being used (it is also a very handy word to know when playing Scrabble or doing crossword puzzles). In HTML, an em is defined as the font size, which is the maximum height of a line of text.

Absolute sizes are useful only when the output properties of the output device is known, and are rarely appropriate when creating e-books. Likewise, pixel measurements should be scrupulously avoided as text based on pixels will not scale properly.

As has been noted, web browsers do not typically indent the first line of a paragraph, and they make up for the loss of indentation by displaying a blank line between paragraphs. You can override this behavior, and emulate the way paragraphs are traditionally rendered in print with the following style rule:

p { text-indent: 2em; margin-top:0; margin-bottom:0 }

Just as you can specify multiple style properties for a single style rule, you can also specify multiple HTML elements for application of the rule. Typically in books, titles and chapter headings are centered on a page. You can emulate this look in HTML with this style rule:

/* Center first four headers */
h1, h2, h3, h4 {text-align:center }

You will note from this example that the style element can contain comments which use the 'C' programming language syntax, that is, comments begin with the character string "/*" and end with the string "*/".

But what if you want a special style applied only to a few instances of an HTML element? This can be accomplished by the use of the class selector in a style rule. If, for example, you wanted a special style for the first paragraph of every chapter you could declare a style rule:

p.first { /*  the style to be applied to this special paragraph */ }

This style would then be applied only to those paragraphs which begin with the tag:

<p class="first">. . .

De gustibus non disputandum est

Studies have shown that on computer screens sans-serif fonts are easier to read than serif fonts. The same pixelization that makes serif fonts messy also results in rather poor right-hand justification for text. For these reasons, I like to include the following style rule in my e-books:

/*  Indent paragraphs, no space between them, ragged-right text  */
p {
     text-indent: 2em;
     margin:0em;
     text-align: left;
     font-family: sans-serif;
}

As incomprehensible as it may seem, there are individuals who do not agree with me on this issue. One of the primary benefits of style sheets is that all of the style rules are in a single place in the HTML file, so changing them is easy. One of the primary benefits of Cascading Style Sheets is that they . . . well, cascade.

Cascading Style Sheet rules can be specified in a <style> element, but they can also be specified in one or more other files included with an HTML file by using the <link> element, as follows:

<link type="text/css" rel="stylesheet" href="mystyles.css" />

What happens when style rules are specified inside the HTML file and also in one or more external style sheet files? If there is no conflict in the style properties for any particular element or class the properties are merged. If the HTML file contains the style rule

p { text-indent: 2em; margin-top:0; margin-bottom:0 }

and the external style sheet contains the rule

p { text-align: left; font-family: sans-serif; }

the resulting rule for the <p> tag would be:

p {
     text-indent: 2em;
     margin-top:0;
     margin-bottom:0;
     text-align: left;
     font-family: sans-serif;
}

If there is a conflict, last rule wins; so if the internal style rule is

p { text-align: justify }

and the external style rule is

p { text-align: left }

and the <link> element comes after the <style> element in the HTML file, the text in paragraphs would be left justified (ragged right). If the <link> element came before the <style> element, the text in paragraphs would be justified to both the right and left margins.

If a <link> element specifies an external style sheet which cannot be found, User Agents should not produce any error messages, but should proceed as if there were no additional style rules.

The way style rules cascade in this manner can be a powerful tool for an e-book designer. A designer can mark up the text with class selectors, and create an internal style sheet which creates a presentation s/he finds pleasing. By including a link to an external style sheet after the internal style element, the ultimate reader can override any or all of the designer's decisions without having to edit the original HTML file at all! If no external style sheet is present, the designer's rules (in conjunction with the UA's default behavior) will control the presentation.

<span> and <div>

Perhaps you have a style that should be applied to a short section of text which is less than that included in any particular element. For this purpose, you use the <span> tag, which is basically a generic tag for classifying sections of text. If you wanted to highlight a portion of your book with a yellow background, like a highlighter, you could do this:

<style type="text/css">
span.Hilite { background: yellow }
</style>

<p>This is a <span class="Hilite">very import concept</span> to remember.</p>

HTML tags are rather arbitrarily divided into groups of inline tags and block-level tags. Inline tags are usually style-related and may only enclose other inline tags, whereas block-level tags are usually structural and may encapsulate both inline tags and other block-level tags. <p> is a block level tag and <em> is an inline tag. Technically, inline tags must be terminated before the end of the block-level element that contains them, although most UAs are fairly lax about enforcing this rule.

The <span> tag is an inline tag. So how do you go about changing the font on a large section of text that may encompass more than one paragraph? The <div> tag is a block-level tag that serves the same function as the <span> tag. Let's say you wanted to display the contents of a hand-written letter in a cursive font. You could accomplish this with the following:

<style type="text/css">
/* use to recreate handwriting */
div.write, div.write p { font-family: cursive }
</style>

<div class="write">
<p>Dear Mother,</p>
<p>I am fine.  How are you?</p>
<p>College is fun.</p>
<p>Please send money.</p>
<p>Your loving son,</p>
<p>Joe Student</p>
</div>

It should be apparent by now that you could create an entire e-book, with all the variety you would want, using exclusively <span> and <div> tags and author-defined classes. This is a very bad idea. As the official CSS specification says, "Authors should avoid this practice since the structural elements of a document language often have recognized and accepted meanings and author-defined classes may not."

Just as there are standard tags with well-accepted meanings, so also should there be standard classes for HTML-based e-books. If everyone used the same classes for the same purposes, and if everyone included an external style sheet with a well-established file name, I could create a my own external style sheet with that same well-established file name, place it in the same folder as my HTML-based e-books, and instantly have my preferences used for all my e-books.

There seems to be a movement in this direction already on the internet, and rather than attempting to create a new standard, I recommend leveraging the existing practices. The standard file name for external style sheets is "abebooks.css", and the following classes should be used, when needed:

<style type="text/css">
<!--  /* Hide the following from CSS challenged browsers */
/*  <h3> is used for chapter headings. */
/*  Put a little extra space before  chapter breaks, and tell
    UAs to start new chapters on a fresh page  */
h3 { margin-top: 2em; page-break-before: always; }

/*  Class used for the first paragraph of each chapter.  */
p.first {}

/*  The first letter is over-sized . . .  */
p.first:first-letter { font-size: 150% }

/*  and the first line is all caps  */
p.first:first-line { text-transform: uppercase }

/*  Class used to recreate typed letters, newsprint or other
    printed documents. */
/*  <tt> . . . </tt> is probably a better alternative, but this
    covers UAs who have not implemente <tt>, or who have done
    so incorrectly */
div.print, div.print p { font-family: Courier, monospace }

/* Class used to recreate handwriting */
div.write, div.write p { font-family: cursive }
--></style>

After the style element, the following link elements should also be provided:

<link type="text/css" rel="stylesheet" href="/abebook/abebook.css" />
<link type="text/css" rel="stylesheet" href="abebook.css" />

In essence, these links tell the User Agent "load and apply the internal style sheet, load and apply the external style sheet found in the "abebook" folder off the root directory, then load and apply the external style sheet found in the same folder as the HTML document. If the files don't exist, don't worry about it"

Obviously, this brief tutorial is insufficient to do more than provide an introduction to Cascading Style Sheets. While I don't think I have presented any incorrect information here, I have certainly not explored all the power and nuances of Cascading Style Sheets. For more information see Dave Raggett's Introduction to CSS, "Adding a touch of style" at http://www.w3.org/MarkUp/Guide/Style, and the books referenced there. The official specification for Cascading Style Sheets, written in dry techese, can be found at http://www.w3.org/TR/REC-CSS2/.

Whodunnit?

So far we have learned how to present a book in electronic format, but we have only touched on the question of how to include information about the book which is not part of the book itself. This information ranges from the obvious: the book's title, author, date of publication and publisher-to the not so obvious: revision history, critical reviews or disclaimers. Information about information is called metadata and HTML provides the <meta> element to contain this metadata.

The <meta> element is an empty element where we can place arbitrary data. It can only appear in the <head> section of an HTML document, never in the <body>. It has six possible attributes, three of which are relevant to e-book metadata: "name", "content", and "scheme".

The "name" attribute is like a key word, which indicates the category of the metadata being stored. The "content" attribute contains the metadata itself, which may contain character references or entities. The "scheme" attribute names a scheme to be used to interpret the content's value. The following is an example of metadata for an electronic book:

<title>Johnny Zed</title>
<meta name="author" content="John Gregory Betancourt" />
<meta name="copyright" content="&copy; 1999 by John Betancourt" />
<meta name="publisher" content="Wildside Press" />
<meta name="date" content="121999" scheme="MMYYYY" />
<meta name="id" content="1587150441" scheme="ISBN" />

The HTML specification does not define "a normative set of properties" for any of these attributes. In other words, you can use any value you want for these attributes. For the metadata itself this poses no problem, but wouldn't it be great if we had a standard set of key words that everyone would recognize?

The Dublin Core Metadata Initiative (named after Dublin, Ohio, where the first workshop for the initiative was held, not for Dublin, Ireland) has come up with 15 metadata elements (not to be confused with HTML elements) that the Open eBook Forum has adopted as standard key words for publication metadata: "title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights" (http://www.dublincore.org/documents/dces/). These 15 elements can be further qualified by the addition of further information, such as the three character relator code established by the United States Library of congress (http://www.loc.gov/marc/relators/). Thus, the author of a work is the "Creator.aut", the illustrator is the "Creator.ill", and the translator is the "Creator.trl". A complete description of how to use Dublin Core Metadata in HTML can be found in RFC 2731 at http://www.ietf.org/rfc/rfc2731.txt, "Encoding Dublin Core Metadata in HTML."

Using Dublin Core Metadata, the above example of metadata would become:

<meta name="DC.title" content="Johnny Zed" />
<meta name="DC.creator.aut" content="John Gregory Betancourt" />
<meta name="DC.rights.copyright" content="&copy; 1999 by John Betancourt" />
<meta name="DC.publisher" content="Wildside Press" />
<meta name="DC.date.available" content="121999" scheme="MMYYYY" />
<meta name="DC.identifer" content="1587150441" scheme="ISBN" />

Indexing using Dublin Core Metadata is accomplished with the "subject" element and an appropriate scheme qualifer such as "LCSH" for the Library of Congress Subject Heading, or "DDC" for the Dewey Decimal Classification". Dublin Core element qualifiers are also non-normative, so if you need a qualifier which has not yet been defined, feel free to make one up!

Unfortunately, the Dublin Core Initiative did not make any provision for versioning or version history in their scheme. For this purpose, I recommend using a <meta> element with the attribute 'name="version"', and including in the content attribute a version number, a date, and a descriptive comment. For example:

<meta name="version" content="1.0 03/03/2001 - original scan" />
<meta name="version" content="1.1 03/15/2001 - conversion to HTML" />
<meta name="version" content="3.0 05/01/2001 - added linked table of contents" />
<meta name="version" content="5.0 07/14/2001 - completed proof-reading" />

Old version information should never be removed, merely add new version information as required.

There remains the question about what to do with large amounts of free-form text which is not easily categorized, and yet which is only tangentially related to the work at hand. This type of information can easily be included in the file as an HTML comment.

A comment in HTML is any text following the character string "<!--" up to the character string "-->". Character entities in comments are not interpreted, as the comment text is never displayed. A disclaimer could be included in an HTML file as follows:

<!--
This is entirely a work of fiction. Any resemblance to any incident or 
individual, living or dead, while perhaps sub-conscious, is purely 
unintentional. So there.
-->

How much metadata should be included in an e-book, if no-one is ever going to see it? As much as you think should. Just remember, it is easy to ignore data which is present, but very difficult to recreate data which is absent.

What to do when things go wrong

<This section will denigrate Micro$oft's (and Adobe's) WYSIWYG html tools, especially the bloated crap produced by Word 2000, and explain how to make good html from bad html>


[1]Early versions of HTML accepted implied end tags. In other words, a structural element indicated by a tag was implicitly closed when another tag was encountered which was inconsistent with an open start tag. For example, when a new paragraph is started by a <p> tag, the current paragraph is implicitly ended. This usage has been deprecated and will not be discussed here.

[2]The HTML ver. 4.01 specification requires a document type declaration (DTD) at the beginning of every HTML document which indicates the type of HTML contained in the document, and which entity set to use. This DTD can be thought of as "meta-meta-data". This footnote, therefore, would be "meta-meta-meta-data", and this sentence as "meta-meta-meta-meta-data"…

[3]More than one index